Leveraging web resources for keyword assignment to short text documents
نویسندگان
چکیده
Assigning relevant keywords to documents is very important for efficient retrieval, clustering and management of the documents. Especially with the web cor-pus deluged with digital documents, automation of this task is of prime importance. Keyword assignment is a broad topic of research which refers to tagging of document with keywords, key-phrases or topics. For text documents, the keyword assignment techniques have been developed under two sub-topics: automatic keyword extraction (AKE) and automatic key-phrase abstraction. However, the approaches developed in the literature for full text documents cannot be used to assign keywords to low text content documents like twitter feeds, news clips, product reviews or even short scholarly text. In this work, we point out several practical challenges encountered in tagging such low text content documents. As a solution to these challenges, we show that the proposed approaches which leverage knowledge from several open source web resources enhance the quality of the tags (keywords) assigned to the low text content documents. The performance of the proposed approach is tested on real world corpus consisting of scholarly documents with text content ranging from only the text in the title of the document (5-10 words) to the summary text/abstract (100-150 words). We find that the proposed approach not just improves the accuracy of keyword assignment but offer a computationally efficient solution which can be used in real world applications.
منابع مشابه
Design and Implementation of a Web directory for Medical Education (WDME): a Tool to Facilitate Research in Medical Education
Introduction: Access to the medical education resources on the web is one of current challenges for researchers and medical science educators. The purpose of current project was to design and implement a comprehensive and specific subject/web directory of medical education. Methods: First, the categories to be incorporated in the directory were defined through reviewing related directories an...
متن کاملWeb Search
With a constantly increasing size of billions of freely accessible documents, one of the major issues raised by the World Wide Web is that of searching in an effective and efficient way through these documents to find these that best suit a user's need. The purpose of the chapter is to describe the techniques that are at the core of today's search engines (such as Google 1 , Yahoo! 2 , Microsof...
متن کاملWeb Search Gemo-Lamsade
With a constantly increasing size of billions of freely accessible documents, one of the major issues raised by the World Wide Web is that of searching in an effective and efficient way through these documents to find these that best suit a user's need. The purpose of the chapter is to describe the techniques that are at the core of today's search engines (such as Google 1 , Yahoo! 2 , Microsof...
متن کاملSearching the World-Wide Web Using Signature Files
A problem conamonly faced by users of the World-Wide Web (WWW) is forgetting the path traversed to reach a previously read document. SWISS (Seeking WorldWide Web Information Using a Signature File Search) is a system designed to alleviate this ’lost document problem’ by incrementally saving the contents of visited documents in a signature file index. SWISS allows the user to retrieve a previous...
متن کاملPosMed: ranking genes and bioresources based on Semantic Web Association Study
Positional MEDLINE (PosMed; http://biolod.org/PosMed) is a powerful Semantic Web Association Study engine that ranks biomedical resources such as genes, metabolites, diseases and drugs, based on the statistical significance of associations between user-specified phenotypic keywords and resources connected directly or inferentially through a Semantic Web of biological databases such as MEDLINE, ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1706.05985 شماره
صفحات -
تاریخ انتشار 2017